Skip to content

Enable guest tracing #695

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Jul 23, 2025
Merged

Enable guest tracing #695

merged 16 commits into from
Jul 23, 2025

Conversation

dblnz
Copy link
Contributor

@dblnz dblnz commented Jul 7, 2025

Description

This PR updates the work @syntactically has done in #103 to latest main and extends it to support tracing the guest for performance.
This effort is part of the #669 issue.
Improvements on the trace file output will be addressed in #704 .

How it works

It defines a TraceBuffer in the guest that holds TraceRecords. These records are only sent to the host when the buffer gets full, or an exit (Out/Halt instructions) is anticipated.
The records are sent to the host using an Out instruction and providing the pointer to an array of TraceRecords.
The host accesses the guest memory, reads these records and writes them to the trace file.

The guest uses the Invariant TSC and rdtsc instruction to read the clock cycles in the guest and calculate the timestamp on the host.
The timestamp is relative to when the TraceInfo started (before creating the Hypervisor Driver in this case).

Two additional crates define the guest logic for tracing. One defines some macros meant to make the annotation of functions simpler to the user

#[hyperlight_guest_tracing_macro::trace_function]
fn function_to_trace() {}

Note: The data processing could be improved (maybe OTEL), I've just used what we had at hand.

Results

Below are some results I have generated based on our hello-world example in the hyperlight repository.

1. My host machine on WSL2 running KVM on Intel(R) Core(TM) Ultra 7 165H

Has Invariant TSC

[338µs    ] BLAKE3 hash of binary is 05e1c8a51d535507af833a5f4e5e4033a42f4e395cdba4cfe7453dcffe3ea76d
[58.197ms ] > entrypoint
[58.294ms ]   > load_gdt
[58.314ms ]   < load_gdt
[58.314ms ]   > load_idt
[58.328ms ]     > init_idt
[58.356ms ]     < init_idt
[58.356ms ]   < load_idt
[58.44ms  ]   hyperlight_main
[58.574ms ] > halt
[59.385ms ] > dispatch_function
[59.386ms ]   > internal_dispatch_function
[59.408ms ]     > try_pop_shared_input_data_into
[59.423ms ]       Start converting buffer
[59.566ms ]       Finish converting buffer
[59.567ms ]     < try_pop_shared_input_data_into
[59.567ms ]     > call_guest_function
[59.567ms ]       Calling guest function
[59.628ms ]       > call_host_function
[59.628ms ]         > call_host_function_without_returning_result
[59.671ms ]           > push_shared_output_data
[59.683ms ]             Start copy of data
[59.683ms ]             Finish copy of data
[59.683ms ]           < push_shared_output_data
[59.683ms ]           > out32
[60.936ms ]           < out32
[60.936ms ]         < call_host_function_without_returning_result
[60.937ms ]         > get_host_return_value
[61.002ms ]           > try_pop_shared_input_data_into
[61.002ms ]             Start converting buffer
[61.046ms ]             Finish converting buffer
[61.046ms ]           < try_pop_shared_input_data_into
[61.046ms ]         < get_host_return_value
[61.046ms ]       < call_host_function
[61.05ms  ]     < call_guest_function
[61.05ms  ]     > push_shared_output_data
[61.05ms  ]       Start copy of data
[61.05ms  ]       Finish copy of data
[61.05ms  ]     < push_shared_output_data
[61.05ms  ]   < internal_dispatch_function
[61.051ms ] > halt
2. Azure VM running mshv3 on AMD EPYC 7763 64-Core Processor Does not report an invariant TSC. I am currently investigating if I can get an Azure VM that reports supporting an invariant TSC.
[67µs     ] BLAKE3 hash of binary is 56561e9811e4a1907f2883cf34345cab6e48ad444201d8a9678860f46749dd41
[941µs    ] > entrypoint
[943µs    ]   > load_gdt
[943µs    ]   < load_gdt
[943µs    ]   > load_idt
[943µs    ]     > init_idt
[944µs    ]     < init_idt
[944µs    ]   < load_idt
[945µs    ]   hyperlight_main
[956µs    ] > halt
[1.431ms  ] > dispatch_function
[1.446ms  ]   > internal_dispatch_function
[1.446ms  ]     > try_pop_shared_input_data_into
[1.446ms  ]       Start converting buffer
[1.448ms  ]       Finish converting buffer
[1.448ms  ]     < try_pop_shared_input_data_into
[1.448ms  ]     > call_guest_function
[1.449ms  ]       Calling guest function
[1.45ms   ]       > call_host_function
[1.45ms   ]         > call_host_function_without_returning_result
[1.452ms  ]           > push_shared_output_data
[1.453ms  ]             Start copy of data
[1.453ms  ]             Finish copy of data
[1.453ms  ]           < push_shared_output_data
[1.453ms  ]           > out32
[1.908ms  ]           < out32
[1.908ms  ]         < call_host_function_without_returning_result
[1.909ms  ]         > get_host_return_value
[1.909ms  ]           > try_pop_shared_input_data_into
[1.909ms  ]             Start converting buffer
[1.91ms   ]             Finish converting buffer
[1.91ms   ]           < try_pop_shared_input_data_into
[1.91ms   ]         < get_host_return_value
[1.91ms   ]       < call_host_function
[1.912ms  ]     < call_guest_function
[1.912ms  ]     > push_shared_output_data
[1.912ms  ]       Start copy of data
[1.913ms  ]       Finish copy of data
[1.913ms  ]     < push_shared_output_data
[1.913ms  ]   < internal_dispatch_function
[1.913ms  ] > halt

TODO

  • For now, the timestamp of the traces is not calculated correctly. It needs to be calculated relative to the start of the guest and base of on the invariant TSC frequency.
  • Check the features introduced in the crates to ensure consistency
  • Expand dump_trace functionality to visualize the traces
  • Investigate invariant TSC not being correctly reported in Azure VMs.

@dblnz dblnz added the kind/enhancement For PRs adding features, improving functionality, docs, tests, etc. label Jul 7, 2025
@dblnz dblnz force-pushed the guest-tracing branch 6 times, most recently from 15f9623 to 030a019 Compare July 9, 2025 12:42
@dblnz
Copy link
Contributor Author

dblnz commented Jul 9, 2025

The output data should be changed to something that other tools can open #704.

@dblnz dblnz marked this pull request as ready for review July 9, 2025 16:53
@syntactically
Copy link
Member

Does not report an invariant TSC. I am currently investigating if I can get an Azure VM that reports supporting an invariant TSC.

Just to document this: it was confirmed that the host under a nested MSHV will incorrectly not see the invariant TSC bit, and this should be ignored; on the azure platforms, at least, the TSC is in fact invariant.

Copy link
Contributor

@jsturtevant jsturtevant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Could we add a step where the test binaries are executing in the CI pipeline? Right now it looks like none of this code is being exercised since it is all behind feature flag.

@dblnz dblnz force-pushed the guest-tracing branch 3 times, most recently from baf50a8 to af68b04 Compare July 16, 2025 12:55
@dblnz dblnz changed the title [WIP] Enable guest tracing Enable guest tracing Jul 16, 2025
@dblnz dblnz force-pushed the guest-tracing branch 2 times, most recently from 6e23f9d to 61feafb Compare July 17, 2025 12:46
In the future, the outb handler will need to take a Hypervisor
instance in order to be able to access register and memory state of
the VM, so it doesn't make sense for these interfaces to be more
public than the `Hypervisor` trait.  Nobody outside of Hyperlight
seems to use these at the moment, so it's probably simplest to
restrict these to `pub(crate)`.

Signed-off-by: Lucy Menon <[email protected]>
This adds (unused) support for creating trace files for sandboxes and
passing them around to relevant sandbox event handler code. This will
be used for collecting debug trace and profiling information.

Signed-off-by: Lucy Menon <[email protected]>
Signed-off-by: Doru Blânzeanu <[email protected]>
This will be useful in the near future, when it will allow
transforming the exe_info into unwind information without an extra
copy.

Signed-off-by: Lucy Menon <[email protected]>
syntactically and others added 6 commits July 21, 2025 15:36
This adds a new interface which tracing code can use to request the
values of registers from the hypervisor supervising a sandbox.

Signed-off-by: Lucy Menon <[email protected]>
This allows for producing profiles of memory usage.

Signed-off-by: Lucy Menon <[email protected]>
Signed-off-by: Doru Blânzeanu <[email protected]>
Signed-off-by: Doru Blânzeanu <[email protected]>
Copy link
Member

@syntactically syntactically left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I left a few relatively minor comments/questions inline.

I had one other question about a commit message, so I couldn't put it in the diff comments: in d3c416c, what does the commit message mean by "the Hypervisor checks for the frame Id, to verify what type of exit it is"?

@dblnz
Copy link
Contributor Author

dblnz commented Jul 22, 2025

This looks great! I left a few relatively minor comments/questions inline.

I had one other question about a commit message, so I couldn't put it in the diff comments: in d3c416c, what does the commit message mean by "the Hypervisor checks for the frame Id, to verify what type of exit it is"?

My phrasing could be better, but I was referring to the OutBAction type, that can be a TraceRecord, TraceMemoryFree, TraceMemoryAlloc or a TraceRecordStack in the context of the tracing.
I'll rephrase when I address the comments you left

syntactically
syntactically previously approved these changes Jul 22, 2025
Copy link
Member

@syntactically syntactically left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really good, thanks!

Please squash the fixup commits (e825f6e onwards) into the original commits for the relevant things before merging.

dblnz added 7 commits July 23, 2025 14:17
…ords

- `hyperlight-guest-tracing` defines a `TraceBuffer` that keeps
  `TraceRecord`s that the guest issues. When the buffer capacity is
  reached, it automatically issues an `Out` instruction with the
  corresponding info for the host to retrieve the buffer.
- The guest can issue `TraceRecord`s by using the
  `hyperlight-guest-tracing-macro` crate that pushes new records to the
  buffer.
- `hyperlight_common` contains the definitions for the frame types a
  guest can send to the host using the `Out` instruction.

Signed-off-by: Doru Blânzeanu <[email protected]>
- When an `Out` instruction is intercepted, the Hypervisor checks for
  the port to verifies what type of exit was encoded.
  Based on this, when a trace record type is received, we copy the array
  of trace records from the guest's memory, calculate the timestamp
  based on the cycles returned by the guest's RDTSC and write the record
  to the trace file.

Signed-off-by: Doru Blânzeanu <[email protected]>
- Add traces wherever we think it might be of use to help us better
  profile the codebase.
- Add flush instructions before halting or running an `OutB` instruction
  to ensure the records are sent before the guest exits due to an error
  or a normal exit

Signed-off-by: Doru Blânzeanu <[email protected]>
- Update CI to test the tracing functionality
- Add features for the Justfile to be able to build guests with features

Signed-off-by: Doru Blânzeanu <[email protected]>
Copy link
Member

@syntactically syntactically left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dblnz dblnz enabled auto-merge (rebase) July 23, 2025 12:08
auto-merge was automatically disabled July 23, 2025 12:09

Base branch requires signed commits

@dblnz dblnz merged commit 26a8954 into hyperlight-dev:main Jul 23, 2025
33 checks passed
@dblnz dblnz deleted the guest-tracing branch July 23, 2025 12:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement For PRs adding features, improving functionality, docs, tests, etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants